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We discuss a new approach to data clustering. We find that maximum likelihood leads naturally to 
an Hamiltonian of Potts variables which depends on the correlation matrix and whose low temper- 
ature behavior describes the correlation structure of the data. For random, uncorrelated data sets 
no correlation structure emerges. On the other hand for data sets with a built-in cluster structure, 
the method is able to detect and recover efficiently that structure. Finally we apply the method to 
financial time series, where the low temperature behavior reveals a non trivial clustering. 



I. INTRODUCTION 

Statistical mechanics typically addresses the question 
of how structures and order arising from interactions in 
extended systems are dressed, and eventually destroyed, 
by stochastic - so-called thermal - fluctuations. The in- 
verse problem, unraveling the structure of correlations 
from stochastic fluctuations in large data sets, has only 
recently been addressed using ideas of statistical mechan- 
ics Jl],g]. This is the case of data clustering problems, 
where the goal is to classify N objects, defined by D di- 
mensional vectors {£i}i=i; in equivalence classes. The 
general idea M consists in postulating a cost function 
that measures how a possible data structure compares 
with the sample one is studying. This cost function can 
be considered as an Hamiltonian whose low energy states 
correspond to the cluster structures which are mostly 
compatible with the data sample. Structures are iden- 
tified by configurations S = {si}^ of class indices, 
where Si is the equivalence class to which object i be- 
longs. Regarding Si as Potts spins, a Potts Hamiltonian 
H q = — J2i<j Ji.jb si, S j has been recently proposed Q as 
a cost function, with couplings Jjj decreasing with the 

distance rfj.j = between objects i and j. The un- 

derlying structure of data sets emerges as the clustering 
of Potts variables at low temperatures. 

In the present work we address the question of data 
clustering. Rather than postulating the form of the 
Hamiltonian, we start from a statistical Ansatz and in- 
voke maximum likelihood and maximum entropy princi- 
ples. In this way, the structure of the Hamiltonian arises 
naturally from the statistical Ansatz, without the need 
of assumptions on its form. The method is particularly 
suited to study high dimensional data sets, where each 
object is characterized by a large number D 3> I of prop- 
erties. Time series are an ideal example of high dimen- 
sional objects. The study of the structure of correlations 
between time series is therefore a crucial benchmark for 
our method. 

First we derive the form of the Hamiltonian in the gen- 
eral case. Then we study the thermal and the ground 
state properties of this Hamiltonian by Monte Carlo 
methods, in three different cases: 1) a synthetic uncor- 



related data set 2) a synthetic data set with a known 
correlation structure and finally 3) a data set composed 
of financial time series with unknown correlations. We 
find that 1 ) for random uncorrelated time series no per- 
sistent structure emerges at low temperatures; 2) if the 
time series are generated with some cluster structure S* , 
we find a phase transition to a low temperature phase 
which is dominated by cluster configurations close to S* . 
Hence the method does not introduce spurious correla- 
tions and is able to recover known correlation structures. 
The nature of the transition is investigated by a simple 
mean field calculation in a simple case. This reveals that 
the phase transition is of first order. 

The financial time series that we will study consists of 
the returns of the assets composing the S&P500 index, 
whose correlations have been the subject of much recent 
interest |^-^[- On one side it has been observed || that 
the S&P500 correlation matrix is affected by considerable 
noise-dressing. Indeed its spectral properties are close 
to those of random, uncorrelated time series. On the 
other hand, these same correlation matrices have revealed 
a non-trivial structure of correlations when analyzed by 
minimal spanning tree methods [Q and by the method j^] 
of ref . Q . These apparently contradictory results raise 
the issue of disentangling in a systematic way the effects 
of fluctuations from real correlations in a large but finite 
data set. This is the main issue we shall focus here. 

Quite interestingly, our analysis of the S&P500 data 
set reveals a low temperature behavior dominated by few 
clusters of correlated assets with scale invariant proper- 
ties. We shall not enter into the details of the economic 
meaning of our findings, which shall be discussed else- 
where p2[ . Our aim is rather to address the problem 
of revealing the structure of bare correlations hidden in 
a finite data set. We show that a thermal average over 
the relevant cluster structures provides a good fit of the 
financial correlations, which allows us to estimate the 
noise-undressed correlation matrix. Finally, we discuss 
several generalizations of our approach to generic data 
clustering. 
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II. THE METHOD 

Let the data set S = {£i}£Lj be composed of N sets 
Ci = {£,i(d)}d=i 01 D measures. These are normal- 
ized to zero mean ^2 d ^i{d)/ D = and unit variance 
J2d£?(d)/D = 1. For example, in our application be- 
low £,i{d) is the normalized daily returns of asset i of the 
S&P500 index, in day d. The data set can also refer to 
a set of N objects which are characterized by D mea- 
sured quantities. In this case £,i(d) is the "normalized" 
value of property d for object i. We assume that 
are Gaussian variables. The reason is that we want to fo- 
cus exclusively on pairwise correlations and the Gaussian 
model is the only one which is completely specified at this 
level. We shall discuss later how deviations from Gaus- 
sian statistics can be accounted for. The key quantity of 
interest is the matrix 
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In order to investigate the structure of correlations, let 
us assume that £i(d) were generated by the equation 



6(d) 



~r] Si (d) + ei(d) 



(2) 



Here g s > and Sj are integer variables (so-called Potts 
spins), rj s (d) and 6i(d) are iid Gaussian variables with 
zero average and unit variance. 

The Ansatz of Eq. (0) was proposed by Noh || to ex- 
plain the spectral properties found in ref. Q. The idea 
behind it is that each set i belongs to one cluster Si and 
that sets i and j in the same cluster (sj = Sj = s) are 
correlated (Cjj m <?s/(1 + gs)) whereas sets in different 
clusters (si ^ Sj) are independent. The s th cluster is 
composed of n s sets with internal correlation c s , where 
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In order to allow for totally uncorrelated sets, we allow Sj 
to take all integer values up to N. Hence S — {sj}i=i de- 
scribes the structure of correlations whereas the parame- 
ters Q = {g s }^Li tune the strength of these correlations. 

Note that this Ansatz is different from the explicative 
factor model used in financial applications ||, which is 
discussed in the Appendix. 

The correlation matrix generated by Eq. (|2[) for — ^ 
oo is 
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as 
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Its distribution of eigenvalues is simple: To each s with 
n s > 1 there correspond one eigenvalue 



A s ,o — 



1 + g s n s 



and n s — 1 eigenvalues X St % = 1/(1 + g s ). Hence, large 
eigenvalues correspond to groups of many (n s 3> 1) sets. 
For D finite, we expect noise to lift degeneracies between 
X St i but to leave the structure of large eigenvalues un- 
changed. 

In order to fit the data set S with Eq. let us com- 
pute the likelihood. This is the probability P(H|<S, Q) 
of observing the data S as a realization of Eq. (g) with 
structure S and parameters Q = {g s }^Li, arid it reads 
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where the average is over all the 77's and e's variables and 
S(x) is Dirac's delta function. Gaussian integration and 
elementary algebra leads to 
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-n s ln(l + g s ) + ln(l + g s n s )] . 
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For any given structure S and D ^> 1, the likelihood 
P(S|<S, Q) is maximal for g s — g s , where 
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for n s > 1 and g s = for n s < 1. Inverting Eq. (Q) gives 
c s = (g s nj.+n s )/ (g s + l) which is exactly what one would 
get combining Eqs. (^,^)- Hence the maximum likelihood 
estimators g s are consistent with our Ansatz (||). 



Note that for uncorrelated sets C. 



Si j we have 



c s = n s for each s and hence g s — 0. The coupling 
strength g s instead diverges for totally correlated sets 
(Cjj = 1) because c s — n 2 s . 

An expansion to second order in g s —g s of Eq. (||) shows 
that the likelihood quickly vanishes for | g s — g s \ >• 1 / \f~D. 
Hence, for D ^> 1, we can simplify things considerably by 
setting g s = g s in Eq. (||). The likelihood of structure S 
under Ansatz (j|) then takes the form P(S|<S) cx e~ DHc , 
where 



HAS} = \ J2 
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log^ + K-l)log^ 
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The ground state Sq of H c yields the maximum like- 
lihood fit with Eq. (|^). This would probably take the 
Ansatz (g) too seriously. In general, it is preferable to 
consider probabilistic solutions P{5} and, following ref. 
jjj , we invoke the maximum entropy principle: Among all 
distributions P{5} with the same average log-likelihood, 
we select that which has maximal entropy. This, as usual, 
leads to the Gibbs distribution P{6>} oc e^^l 5 } where 
the inverse temperature f3 arises as a Lagrange multiplier. 

The Hamiltonian H c depends implicitly on the Potts 
spins Si through the cluster variables n s and c s of Eq. (|^). 
Unlike the Potts Hamiltonian H q , the dependence on 
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S SitSj is non- linear and it is modulated by Ci.j. For 
Si =/= Sj for all i j we have n s = c s = 1 for all s 
and hence H c = 0. This state is representative of the 
high temperature (/3 — > 0) limit. The low temperature 
physics of iJ c is instead non-trivially related to the cor- 
relation matrix Cjj. Note, that the ferromagnetic state 
3, = 1 for each i, which dominates as [3 — > oo in clus- 
tering methods based on Potts models is in general 
not the ground state of H c . Intuitively we expect that, if 
the model of Eq. ^ is reasonable, H c should have a well 
defined ground state and low temperature phase which 
is energetically dominated by this state. In these cases, 
as in ref . |2| , we expect a thermal phase transition |9) . 

The form of the Hamiltonian H c clearly depends on the 
Ansatz (H) ■ For example if one takes a factor model for 
the correlations one finds a different Hamiltonian which 
depends on different variables, as discussed in the Ap- 
pendix. Also note that the present model only describes 
positive correlations. It is straightforward to generalize 
this approach to the case where a sizeable number of ma- 
trix elements dj arc negative and not small. The idea 
is to introduce spin variables cr^ = ±1 for each set and 
modify Eq. (||) by multiplying the right hand side by 
Oi. This leads us to the analysis of a system where the 
Potts variables Si and the spin variables o~i interact. An 
account of this method shall be given elsewhere jl2| . 

III. THE DATA 



to be diversified (i.e. divided) on many uncorrelated as- 
sets, so that price fluctuations are averaged out. However 
the measure of correlation in samples with a number of 
observation times comparable to the number of assets 
was recently found to be affected by considerable noise- 
dressing For example the S&P500 is composed of 
N = 500 assets and considering daily data from July 3rd 
1989 to October 27th 1995 one has D = 1600 data points 
for each asset. This data set is then an ideal instance of 
a problem where our method applies. 
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FIG. 1. Distribution of eigenvalues of the correlation ma- 
trices of S&P500 (full line •) random (dotted x) and synthetic 
correlated (dashed o) time series. 



We consider three different data sets, i.e. three differ- 
ent matrices Ci j . For all of them we took N = 443 time 
series of length D = 1599. The results with shorter time 
series will also be discussed below. 

The first data set refers to N totally uncorrelated time 
series of length D. This is obtained, for example, from 
Eq. (0) with Si = i. The second also is obtained from 
Eq. (El), but this time with preassigned structure S* and 
coupling strengths Q* . We shall discuss below how the 
specific structure was chosen. These two data sets help us 
to understand how the method performs when no struc- 
ture at all is present and to check whether a predefined 
structure can be recovered. 

Our third data set is made of financial time series of 
asset prices relative to the Standard & Poors 500 index 
(S&P500). More precisely £i(d) is the normalized daily 
returns of asset i of the S&P500 index, in day d; this is 
defined as 

= log[Md)/ Pl (d-l)]-r^ 

0~i 

where pi (d) is the price of asset i in day d. The parame- 
ters 7"j and <Xj are determined in order to have zero mean 
and unit variance for all i. 

Correlation matrices of financial time series are of great 
practical interest. Indeed they are at the basis of risk 
minimization in the modern portfolio theory fJ. This 
states that, in order to reduce risk, the investment needs 



In addition, this data set has been studied by other 
authors with several methods including spectral analysis 
||, minimal spanning tree Q and super-paramagnetic 
clustering S. This allows us to compare the results of 
our method with those of other methods. Finally, in or- 
der to better appreciate the performance of our method 
in a real application, we choose the synthetic correlated 
data set {S*,G*} as a "large likelihood" structure of the 
S&P500 data set. In other words we performed a sim- 
ulated annealing experiment on the S&P500 data set, 
where the fictitious temperature 1//3 was gradually de- 
creased to 0. This allowed us to compare how well the 
real S&P500 data set can be described by a maximum 
likelihood structure^. 

Figure |l| shows a comparison of the spectral properties 
of the three correlation matrices. The spectrum of eigen- 
values for uncorrelated time series are known exactly . 
It extends over an interval of size ~ N/D around A = 1, 
as shown in Fig. The spectrum of eigenvalues of the 
S&P500 correlation matrix has a similar shape for A ps 1. 
This suggests that significant noise-dressing due to finite 



"The maximum likelihood structure may be computationally 
hard to find and simulated annealing may get trapped into a 
local minimum. Indeed in our case we found slightly different 
structures in different experiments. The structure S* was that 
with largest likelihood. 
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D occurs The tail of the distribution (A ^> 1) implies 
that some correlation is however present. Within our 
framework, large eigenvalues are associated with large 
clusters. Indeed the synthetic correlated data set has a 
broad distribution of cluster sizes (see Fig. |J) and a cor- 
respondingly fat tail in the distribution of eigenvalues. 

IV. CLUSTERING BY MONTE CARLO 
SIMULATIONS 

In order to study the properties of H c we resort to 
Monte Carlo (MC) method with Metropolis algorithm 
p0| . This, at equilibration, allows us to sample the Gibbs 
distribution P{S} and compute average quantities, such 
as the internal energy Ep = (H c )p where (■ ■ -)p stands 
for thermal average. To detect the occurrence of sponta- 
neous magnetization - which correspond to the s.j remain- 
ing locked into energetically favorable configurations at 
low temperature - we measure the autocorrelation func- 
tion 

X{t,T)= -z . (10) 

l^Kj s t (t), Sj (t) 

This quantity tells us how many pairs of sites belonging 
to the same cluster at time t are still found in the same 
cluster after r MC steps. For t large enough, \ becomes 
a function of r only. This function decreases rapidly to 
a plateau value Xf) = (x(^ r ))/3 for i » r ^> 1. Clearly 
Xfj — implies that no persistent structure is present 
whereas, at the other extreme, xp = 1 implies that all 
sites are locked in a persistent structure of clusters. 

We monitored these quantities for the three data sets. 
Let us start with a truly uncorrelated time series. We 
generate the time series and then compute Cij by 
Eq. (|l|). With this we compute the Hamiltonian H c and 
study its thermal properties by the MC method. We do 
not expect any clustering to emerge in this case. Indeed, 
the internal energy Ep stays very close to (see Fig. ||a) 
for all values of (3 investigated up to (3 — 512. Corre- 
spondingly no persistent cluster arises, i.e. xp — 0- 

The results change turning to correlated data. Let 
us first discuss the S&P500 data (for D = 1599). As 
Fig. ||a shows, for (3 w 20 the energy Ep starts deviating 
significantly from zero. For (3 > 20 persistent clusters 
are present: Xf3 rapidly raises from zero and it has a 
maximum at (3 ~ 40 (see Fig. ||b). The energy fluctu- 
ations reported in the inset shows a broad peak of in- 
tensity marking the onset of an ordered low temperature 
phase. As (3 increases the dynamics is significantly slowed 
down. At (3 ~ 200 the energy reaches a minimal value 
Ep ~ —0.11 N and does not decrease significantly increas- 
ing (3 at least up to (3 — 4095. This energy is smaller than 
that of the ferromagnetic state (Ef = —0.086 A), with all 
sets in the same cluster. The system in this range of tem- 
peratures visits only few configurations. 
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FIG. 2. a) Energy Ep as a function of /3 for random (x), 
S&P500 (+) and correlated (□) data sets of length D = 1599 
respectively. The results for the S&P500 data set over the 
last D — 400 days are also shown (•). Inset: square energy 
fluctuation SEp vs j3 for the same data sets (same symbols), 
b) Autocorrelation xp as a function of /3 for the same data sets 
(same symbols). The full (dashed) line refers to the overlap 
with the configuration s* for the S&P500 (correlated) data 
set with D = 1599. 
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FIG. 3. Rank plot of n a for several values of /3. The line 
corresponds to n ~ rank - 1,2 . Inset: c a versus n s for j3 = 256 
(•) and P (□). The line corresponds to c ~ n 1 ' 66 . 

The statistical properties of cluster configurations, as (3 
varies, are shown in Fig. ||. For small (3 only small clusters 
survive to thermal fluctuations. As (3 increases a distri- 
bution of cluster sizes develops. At low temperatures the 
rank order plot of n s reveals a broad distribution of clus- 
ters with the largest aggregating more than 190 sets. By 
a power law fit of this distribution, we find that the num- 
ber of clusters with more than n sets decays as n -0 83 . 
The scatter plot of c s versus n s also reveals a non-trivial 
power law dependence c s ~ n^ 66 . This gives a statistical 
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characterization of the dominant configurations of clus- 
ters at low energy. The clusters structure we obtain is 
reasonable from the economic point of view: companies 
in the same economic sector belong to the same cluster. 
These issues will be discussed in detail elsewhere [ p^[ . 
Here we restrict our attention to the clustering method. 

It is instructive to compare these results with those ob- 
tained form shorter time series. We performed a second 
set of simulations with Cy computed using the quotes 
of the S&P500 assets for the last D = 400 days. We find 
two inverse temperatures j3\ « 20 and @2 ~ 80 which 
separates three regimes. This is signalled by the bend- 
ing in the Ep curve and by peaks in the SEp vs (3 plot. 
At the first temperature clusters start to appear. For 
(3 < /?2 the largest cluster groups less than 30 sets and 
for (3 > 02 larger clusters n s ~ 100 appear. This hints at 
a time dependence of correlations, which are averaged in 
the D — 1599 data set. For even shorter time series we 
found that sampling errors, acting like a temperature, 
destroy large clusters and only relatively small clusters 
(n s < 40 for D — 60) were found. 

Finally let us discuss the results for the synthetic corre- 
lated data set (for D = 1599). As already mentioned, the 
structure S* is a typical low energy configuration for the 
S&P500 data set extracted from the previous simulations 
(with D = 1599). The parameters g* where deduced 
from the n s and c s of this configuration, via Eq. (|). We 
recall that this data set is useful for two reasons: first it 
allows one to understand to what extent a structure of 
correlation put by hand with the form dictated by Eq. (|^) 
can be correctly recovered. Secondly it allows us to com- 
pare the results found for the S&P500 data with those of 
a time series with correlations described by Eq. (||) . 

For (3 < 150, the behaviors of Ep, SEp and X0 are 
similar to those found for the S&P500 data (see Fig. ||). 
A second, sharp peak in SEp at (3 ~ 170 signals a new 
clustering transition. Below this temperature, as shown 
by the plot of \f3 (Fig- H ), the MC dynamics freezes 
into the original structure S* . The overlap with the con- 
figuration S* , defined as in Eq. (|l0|) as the fraction of 
"bonds" Si = Sj for which s* — s*, quickly converges 
to 1 (see Fig. ||b) for the synthetic time series, whereas 
it remains around 60% for the S&P500 data set. This, 
on one hand means that the original structure S* can 
be recovered quite efficiently. On the other hand, it sug- 
gests that several cluster configuration compete at low 
temperatures in the S&P500 data set. 



V. MEAN FIELD MODEL 

In this section we would like to determine the nature 
of the clustering transition that takes place in our sys- 
tem. We apply our method to an unrealistically simple 
situation that will allow us to extract analytical expres- 
sions for the order parameter associated with the phase 
transition. Our analysis is rather similar to that in 

We take N time series that belong to M clusters of the 



same size n = N/M. Let be the cluster index of the 



time scries , and let 



C, 
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be the correlation matrix for D — oo. This means that 
time series with Si = Sj have correlation 7 whereas ^ 
Sj have = 0. The matrix has N/M blocks of 
size M along the diagonal and is zero elsewhere. A finite 
D sample of this problem is generated with Eq. (||) with 
g s = 7/(1 - 7) and s i = s,. 

To fix ideas we can imagine to have the problem of 
putting N balls of M different colors in M boxes. Col- 
ors represent the original structure Sj whereas boxes rep- 
resent the actual clustering configuration Sj. When the 
balls contained in each box have the same color, the orig- 
inal cluster structure has been recovered. Let m c s be the 
number of balls of color c in box (or cluster) s. Now 
J2s m s = n = N/M is the overall number of balls of 
color c, assumed equal for all colors, and V] c m c s = n s 
is the number of balls in box s, as in Eq. (pT). With the 
above choice of parameters the internal correlation of box 
s for a given configuration {m^} of clusters is 



(1 -7)n s +7^ 



rn 



c2 



(12) 



To compute the free energy F = U — TS of the system 
we use the energy H c as in Eq. (||) and we estimate the 
configuration entropy in the following way: The number 
of ways in which one can distribute the balls of color c 
by putting m c s of them in box s is 



(Es m s) ! = nl 

IL("»Si) ILK0 : 



(13) 



and the total number of configurations for all the colors 
is the product over c of this expression. Taking the loga- 
rithm of this product we obtain the configuration entropy 
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where we have approximated in the usual way the loga- 
rithm of the factorial. Finally the free energy is: 



F s = - 
2 
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After substitution of Eq. ( |12|) we find an expression 
which depends on the occupation variables m c s . The oc- 
cupation in different boxes are related by the overall con- 
straints J2 S m s = N/M. We take the mean-field approx- 
imation, which is legitimate in cases like this, where we 
neglect these effects. In other words, we minimize each 
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of the F s independently and we suppress therefore the 
subscript s from now on. 

We can then focus on just one box and look for solu- 
tions of the form 



N 

M | T7- T 



c= 1 



(16) 



with < <f> < 1. In this Ansatz, the balls of color c = 1 
are more (or less) numerous that those of other colors 
c > 1. In the spirit of mean-field approximation, we 
neglect the possibility that the number of balls of colors 
c > 1 may be unevenly distributed. Hence the parameter 
<t> plays the role of the order parameter. 

The paramagnetic solution <j) = V-^j which corre- 
sponds to an uniform distribution of colors, is always 
a solution of the saddle point equations |£ = 0. This 
state is expected to be stable (the minimum of F) at high 
temperature. A second solution of ^ = 0, which corre- 
sponds to the clustered "ferromagnetic" state, appears at 
intermediate temperatures with tf> 1. For T — T c the 
values of the free energy corresponding to the two states 
are equal and a first order phase transition to a ferro- 
magnetic state takes place. The order of the transition 
is independent of the values of the parameters, while the 
critical temperature is determined mainly by the number 
of time series N. 

We checked that this result is compatible with that 
obtained from Monte Carlo simulations. We find that the 
mean-field approach provides a good qualitative picture 
of the transition and a reliable estimate of the critical 
temperature at which it takes place (see Fig. ||). 
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FIG. 4. Ferromagnetic clustering: filled dots and full lines 
correspond to Monte Carlo and analytical results for a sys- 
tem with N — 150, M — 6; empty dots and dashed lines to 
N = 2400, M = 24. 7 is always 0.3. 



VI. NOISE UNDRESSING 

Eq. (||) with a single cluster configuration {(3 — > 00), 
is inadequate to capture the full complexity of the corre- 
lations in the S&P500 data set. Probabilistic clustering, 
where several cluster structures S are allowed with their 
Gibbs probability P{S}, provides an alternative approx- 
imation. In this approach the parameter [3 can be tuned 



to determine the optimal spread in configuration space, 
which best describes the correlation structure built in 
the original data set. This line of reasoning will lead us 
to a method to "fit" the correlation structure of a data 
set with a single parameter (3. This will finally allow us 
to undress the correlation matrix dj(D) of its noise- 
dressing and to reveal the bare correlations. 

Let us start by remarking that the problem with 
Eq. (g) is that it stipulates that a set i can belong to 
only one cluster. This suggests to consider the general- 
ized model 



6(d) 



iT] s {d) + €i{d) 



(17) 



where each set i can belong to any cluster s. Eq. (jL7J) has, 
on the other side, the disadvantage that it depends on 
too many variables, and it leads to overfitting stochastic 
fluctuations. 

The finite temperature distribution P{S} provides a 
natural way out of this situation. Indeed at finite (3 each 
set i visits different clusters s and we can define 
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The parameters g s ,i{(3) can be measured in a MC sim- 
ulation and provide us with a measure of the strength of 
the correlation between set i and cluster s. 

These parameters and Eq. (|l7]) also allow us to gener- 
ate synthetic data sets 6(d), whose statistical properties 
can be compared to those of the original data set. We 
make this comparison using the spectral properties of the 
correlation matrix. In other words, with Eq. ([!]) and £j (d) 
we compute a "/3-synthetic" correlation matrix; we deter- 
mine the spectrum of eigenvalues and compare it to that 
of the original matrix. The parameter (3 can be tuned in 
order to get the best fit. 

This procedure was carried out for the S&P500 data 
set. The eigenvalue spectra of the two matrices are com- 
pared in Fig. H for (3 — 48. The value of f3 was chosen by 
visual inspection as that giving the best fit. The curves 
are remarkably close, suggesting that Eq. (|l7]) provides 
a good statistical description of the correlations among 
assets. 

Once the value (3* which gives the best fit is found, we 
can compute the noise undressed correlation matrix 



C*,-(oc) 



WSJ 



(19) 



from the parameters g* i = g s ,i(P*)- This is the correla- 
tion matrix of a synthetic data set obtained from Eq. ( |l7j ) 
with D — > 00. Fig. [5] shows the eigenvalue distribution 
of the noise undressed matrix C*j(oo). This allows one 
to appreciate the effect of noise dressing. As expected, 
noise mainly affects small eigenvalues. 
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□ — -□ dressed 
bare 
S&P500 



FIG. 5. Comparison of the spectrum of the S&P500 corre- 
lation matrix (full line •) with noise-dressed (dotted □) and 
bare (dashed +) correlation matrices generated by Eq. (|l7|). 



VII. DISCUSSION AND CONCLUSIONS 



The applicability of the method can be extended con- 



need not 



siderably to a generic data set {xi}f =1 . 
be a time series. The distribution of Xi(d) need not 
be Gaussian and it does not even need to be the same 
across i. For example, Xi(d) may be the measure 
of the d th feature of the i th object or the concentra- 
tion of species i in the d th sample of an experiment. 
The idea is to map the data set Xj into a Gaussian 
time series to which we apply Eq. (|^). The map- 
ping results from requiring that non-parametric cross- 
correlations rfj = rf j are preserved. To do this in prac- 
tice we compute Kendall's r 



T tj = (sign[a*(d) - x^d' )]sign[ay(d) - Xj (d')}) 
two Gaussian time series with correlation C* j one can 



]13|] for the Xi data sets: 
d<d>- For 



compute analytically rfj 
to the relation 



in the limit D — > oo. This leads 



C, 



tan(7rr^./2) 
l + tan 2 (7TT^./2) 



(20) 



between Gaussian and non-parametric correlations. This 
equation allows us to translated non-parametric corre- 
lations into Gaussian correlations. From these one can 
build H c of Eq. (||) and study the clustering properties. 

This procedure has been tested for the S&P500 data 
set, for which it is known that £i(<i) has non-Gaussian 
statistics Jllfl . We have found indistinguishable results 
which indicate that the deviation from Gaussian behavior 
have little or no effect on the results. We expect that this 
approach breaks down when the marginal distribution of 
£i(<i) is such that the second moment is not defined. In 
that case Cij computed from rjj and Eq. (|2^) can even 
fail to be positive definite. 

With respect to ref. || , our approach does not need any 
assumption on the form of the Hamiltonian. As input, 
the method only needs the correlation matrix Ci_j (or 
Tij). The range of interactions is set by the correlations 



themselves. Indeed our method predicts a non-trivial 
ground state Sq which is not, in general, the ferromag- 
netic one. 

For small D, the local interaction of ref. [^[ may well 
be more efficient in capturing the structure of data. Our 
method is most useful in cases where D ~ N ^> 1. These 
ideas can clearly be extended to models of correlations 
different from Eq. (^|) as shown, for example, in the Ap- 
pendix. 

We acknowledge R. Zecchina, R. Pastor-Satorras and 
D. Vergni for interesting discussions and R. N. Mantegna 
for providing the S&P500 data. 



APPENDIX A: 

We define here the explicative factor model for stocks 
returns (also known as multi index model, see e. g. ||): 



v t rf(t) 



Kf (Al) 



where Hi are L dimensional vectors and ff(t) is a L 
dimensional Gaussian random vector (r] a (t)) = and 

The idea is that there are L explicative factors 
rj 1 (t) , . . . , rj L (t) which describe the fluctuations of each 
stock price. This model is different from the one we con- 
sidered in the text in that each time series is coupled to all 
other with a different strength. This can be easily under- 
stood by observing that the model (|^) can be cast in the 
form of an explicative factor model with vf = g Si S a ^ Si . 
This is a rather particular form of Eq. ( |Al[ ). We observe, 
however, that L must be much smaller than N in order to 
avoid problems of overfitting with Eq. (|Al[ ) , while Eq. (0) 
requires L « N. 

As we did in Sect, [n] we look at the probability of 
observing the time series £j(t) given the model Eq. (Al) 
and the parameters Uj: 



D 



N 



v l rj+ a 



(A2) 



After taking the average over the Gaussian variables one 
obtains the equivalent of Eqs. @ and (j|) 



-DH{v} 



ffM = ^E[( 1 + w ?)- 1 °s(i + «?)]■ 

i 

iTrlogd + ^-iTr^, 
where we have defined the matrices 

N 
i=l 

N 



(A3) 

(A4) 
(A5) 
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and C'ij is defined in Eq. ([!]). We note that the sec- 
ond term in Eq. (A3) is sub-extensive, and could be ne- 
glected; nevertheless in the presence of the matrix \ a 
Monte Carlo simulation becomes excessively time con- 
suming, since a change in Vk requires order N operations 
to compute the new matrix. This may considerably limit 
the practical applicability of this method. 
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